Hi Team, Can you please let us know the status of this case ? Is there any workaround available for this issue?
This is pretty easy to reproduce, I tried the simplest scenario on a single node using local-up-cluster.sh: # kubectl get nodes NAME STATUS ROLES AGE VERSION 127.0.0.1 Ready <none> 16s v1.25.0-alpha.0.814+b7337cb171ab12-dirty 1) create a pod with single pvc (using the csi hostpath driver in this case) 2) kill kubelet 3) force delete the pod 4) restart kubelet and wait for pod to delete 5) check the mount and kubelet log. The 'not a directory' errors quickly grew in the kubelet log: # grep 'not a directory' /tmp/kubelet.log | wc -l 300 Line by line walkthrough of the log messages below. getVolumesFromPodDir: I0613 22:12:46.095863 81722 reconciler.go:716] "Volume path from volume plugin directory" podName="c17ac5dd-1c61-40db-9e64-6b61103dbd23" volumePath="/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d" ReconstructVolume: I0613 22:12:46.097166 81722 operation_executor.go:1042] Starting operationExecutor.ReconstructVolume for file volume on pod "c17ac5dd-1c61-40db-9e64-6b61103dbd23" ContructVolumeSpec -> loadVolumeData: I0613 22:12:46.097174 81722 csi_plugin.go:515] kubernetes.io/csi: plugin.ConstructVolumeSpec [pv.Name=pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d, path=/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d] I0613 22:12:46.097188 81722 csi_util.go:80] kubernetes.io/csi: loading volume data file [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/vol_data.json] I0613 22:12:46.097240 81722 csi_plugin.go:522] kubernetes.io/csi: plugin.ConstructVolumeSpec extracted [map[string]string{"attachmentID":"csi-b4a39ac1071f586ac52f826c974aaf60747b0decf4812fcef0acb7940e780e94", "driverName":"hostpath.csi.k8s.io", "nodeName":"127.0.0.1", "specVolID":"pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d", "volumeHandle":"fef36ea0-eb60-11ec-91f2-72e3dd2e9bdd", "volumeLifecycleMode":"Persistent"}] NewMounter -> saveVolumeData: I0613 22:12:46.097301 81722 csi_mounter.go:87] kubernetes.io/csi: mounter.GetPath generated [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/mount] I0613 22:12:46.097316 81722 csi_plugin.go:444] kubernetes.io/csi: created path successfully [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d] I0613 22:12:46.097328 81722 csi_util.go:63] kubernetes.io/csi: saving volume data file [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/vol_data.json] I0613 22:12:46.097390 81722 csi_util.go:72] kubernetes.io/csi: volume data file saved successfully [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/vol_data.json] I0613 22:12:46.097434 81722 csi_plugin.go:480] kubernetes.io/csi: mounter created successfully I0613 22:12:46.097446 81722 csi_mounter.go:87] kubernetes.io/csi: mounter.GetPath generated [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/mount] syncStates: reconstruct fails because /mount subdir does not exist: I0613 22:12:46.097494 81722 reconciler.go:400] "Could not construct volume information, cleaning up mounts" podName=c17ac5dd-1c61-40db-9e64-6b61103dbd23 volumeSpecName="pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d" err="could not check whether the volume \"kubernetes.io/csi/hostpath.csi.k8s.io^fef36ea0-eb60-11ec-91f2-72e3dd2e9bdd\" (spec.Name: \"pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d\") pod \"c17ac5dd-1c61-40db-9e64-6b61103dbd23\" (UID: \"c17ac5dd-1c61-40db-9e64-6b61103dbd23\") is mounted with: stat /var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/mount: no such file or directory" cleanupMounts: I0613 22:12:46.097509 81722 reconciler.go:433] "Reconciler sync states: could not find volume information in desired state, clean up the mount points" podName=c17ac5dd-1c61-40db-9e64-6b61103dbd23 volumeSpecName="pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d" UnmountVolume -> NewUnmounter: I0613 22:12:46.097527 81722 csi_plugin.go:486] kubernetes.io/csi: setting up unmounter for [name=pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d, podUID=c17ac5dd-1c61-40db-9e64-6b61103dbd23] That calls GetPath and loadVolumeData again: I0613 22:12:46.097569 81722 csi_util.go:80] kubernetes.io/csi: loading volume data file [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/vol_data.json] TearDownAt: attempt to unmount volume I0613 22:12:46.097749 81722 csi_mounter.go:364] kubernetes.io/csi: Unmounter.TearDownAt(/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/mount) E0613 22:12:46.097789 81722 operation_generator.go:848] UnmountVolume.MarkVolumeMountAsUncertain failed for volume "" (UniqueName: "pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d") pod "c17ac5dd-1c61-40db-9e64-6b61103dbd23" (UID: "c17ac5dd-1c61-40db-9e64-6b61103dbd23") : no volume with the name "pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d" exists in the list of attached volumes This fails because the CSI driver is not registered yet: E0613 22:12:46.097832 81722 nestedpendingoperations.go:335] Operation for "{volumeName:pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d podName:c17ac5dd-1c61-40db-9e64-6b61103dbd23 nodeName:}" failed. No retries permitted until 2022-06-13 22:12:46.597812352 +0000 UTC m=+21.850115672 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "" (UniqueName: "pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d") pod "c17ac5dd-1c61-40db-9e64-6b61103dbd23" (UID: "c17ac5dd-1c61-40db-9e64-6b61103dbd23") : kubernetes.io/csi: Unmounter.TearDownAt failed to get CSI client: driver name hostpath.csi.k8s.io not found in the list of registered CSI drivers And then we get to cleanupOrphanedPodDirs: E0613 22:12:47.038491 81722 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"c17ac5dd-1c61-40db-9e64-6b61103dbd23\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1 I0613 22:12:47.038509 81722 kubelet_volumes.go:247] "Orphan pod" err="orphaned pod \"c17ac5dd-1c61-40db-9e64-6b61103dbd23\" found, but error occurred when trying to remove the volumes dir: not a directory" So the sequence of events as I currently understand it is: 1) kubelet calls ReconstructVolume after restart because getVolumesFromPodDir finds the volume plugin directory. 2) The /mount subdir gets removed, either by cleanupOrphanedPodDirs or by a successful unmount (there is not always a log message for this event). 3) ReconstructVolume reads vol_data.json to construct the volume spec and creates a new mounter object, which writes an update to vol_data.json (saveVolumeData). 4) syncStates fails to reconstruct the volume because the /mount subdir does not exist. 5) The volume is marked as uncertain, and TearDownAt is called in an attempt to unmount it, but this fails because the volume is not in the list of attached volumes. 6) cleanupOrphanedPodDirs attempts to remove the orphaned volumes directory, but fails because vol_data.json still exists, because TearDownAt never succeeded and therefore never removed the file. This generally seems like a lack of coordination between the volume reconstruction and orphaned pod cleanup logic. Orphaned pod cleanup depends on reconstruction to remove vol_data.json, but if the /mount subdir is removed then reconstruction will fail and vol_data.json will never get removed.
I re-ran the test against lastest kubernetes master branch today and can no longer reproduce the issue. root@ubuntu2204:/workspace/csi-driver-host-path# kubectl get nodes NAME STATUS ROLES AGE VERSION 127.0.0.1 Ready <none> 60m v1.26.0-alpha.0.21+93054957c13a53-dirty root@ubuntu2204:/workspace/csi-driver-host-path# kubectl get pods NAME READY STATUS RESTARTS AGE csi-hostpath-socat-0 1/1 Running 0 56s csi-hostpathplugin-0 8/8 Running 0 57s my-csi-app 1/1 Running 0 21s snapshot-controller-0 1/1 Running 0 61s root@ubuntu2204:/workspace/csi-driver-host-path# mount | grep csi /dev/mapper/ubuntu--vg-ubuntu--lv on /var/lib/kubelet/pods/a08c449e-efbe-4f7a-b3bd-4edf377706ff/volumes/kubernetes.io~csi/pvc-e7c4206e-d47d-4e13-9ca8-c1c983bcd806/mount type ext4 (rw,relatime) root@ubuntu2204:/workspace/csi-driver-host-path# kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-e7c4206e-d47d-4e13-9ca8-c1c983bcd806 1Gi RWO Delete Bound default/csi-pvc csi-hostpath-sc 34s After killing kubelet, deleting the pod, and starting kubelet back up, the pod gets properly cleaned up with no 'not a directory' errors in kubelet.log. root@ubuntu2204:/workspace/csi-driver-host-path# kubectl get pods NAME READY STATUS RESTARTS AGE csi-hostpath-socat-0 1/1 Running 0 4m52s csi-hostpathplugin-0 8/8 Running 1 (49s ago) 4m53s snapshot-controller-0 1/1 Running 0 4m57s root@ubuntu2204:/workspace/csi-driver-host-path# mount | grep csi root@ubuntu2204:/workspace/csi-driver-host-path# grep 'not a directory' /tmp/kubelet.log root@ubuntu2204:/workspace/csi-driver-host-path# ls /var/lib/kubelet/pods/a08c449e-efbe-4f7a-b3bd-4edf377706ff/ ls: cannot access '/var/lib/kubelet/pods/a08c449e-efbe-4f7a-b3bd-4edf377706ff/': No such file or directory Whatever fixed this behavior, it was between these two commits: $ git log 93054957c13a53 | head -9 commit 93054957c13a5398a1b0eabb3d9c77f896e46093 Merge: 619f1cf5520 44a59710ea2 Author: Kubernetes Prow Robot <k8s-ci-robot.github.com> Date: Wed Aug 17 09:09:11 2022 -0700 Merge pull request #111859 from omertuc/fixparallelclosure2 Fix failing `test/e2e/storage/csi_mock_volume.go` test $ git log b7337cb171ab12 | head -9 commit b7337cb171ab126cb892aacdab2816017a290841 Merge: 4988bfce914 061b8e80494 Author: Kubernetes Prow Robot <k8s-ci-robot.github.com> Date: Fri Jun 3 13:32:20 2022 -0700 Merge pull request #110235 from jsturtevant/fix-new-restricted-pod-tests test: RunAsUser causes pods to not start on Windows And we had a few fixes for volume reconstruction merged in kubernetes 1.25. These are the 2 main PR's that I recall did significant work in this area, It's likely that one of these PR's already fixed it upstream. https://github.com/kubernetes/kubernetes/pull/110670 https://github.com/kubernetes/kubernetes/pull/108692
In my last set of experiments, this scenario was fixed (i.e. kubelet restart) on k8s 1.25 / OCP 4.12: 1) create a pod with single pvc (using the csi hostpath driver in this case) 2) kill kubelet 3) force delete the pod 4) restart kubelet and wait for pod to delete 5) check the mount and kubelet log. However, if the node is restarted, and therefore the directory is unmounted when kubelet starts back up, then we can still get the 'not a directory' errors described in this bug, along with this 'is not mounted' error: I0818 18:55:02.696639 1821557 reconciler.go:506] "Could not construct volume information, cleaning up mounts" podName=50c19ad6-0a5d-4684-a02e-86c1365479c9 volumeSpecName="pvc-e7c4206e-d47d-4e13-9ca8-c1c983bcd806" err="volume: \"kubernetes.io/csi/hostpath.csi.k8s.io^ac6d14b9-1e54-11ed-bd38-fabf9fe5c695\" is not mounted" There is an upstream issue for this part of the problem: https://github.com/kubernetes/kubernetes/issues/111933
And this is the commit that seems to have fixed the first part of this issue (kubelet restart): https://github.com/kubernetes/kubernetes/commit/daa181d92e85091b1e537d285139168d1ad5f10b I did not have that fix in my workspace at the time of this test: https://bugzilla.redhat.com/show_bug.cgi?id=2090974#c10
To be clear about the scope of this bug: the symptoms are excessive "orphaned pod" error messages in kubelet log, and leftover directories under /var/lib/kubelet/pods that can't be cleaned up automatically after kubelet (or node) restarts. If you're seeing any symptoms aside from that, it will most likely not be fixed by this bug.
We no longer track bugs in Bugzilla, migrated to Jira: https://issues.redhat.com/browse/OCPBUGS-7345