2090974 – orphaned pod pod_id found, but error not a directory occurred when trying to remove the volumes dir" numErrs=17

Bug 2090974 - orphaned pod pod_id found, but error not a directory occurred when trying to remove the volumes dir" numErrs=17

Summary: orphaned pod pod_id found, but error not a directory occurred when trying to...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jonathan Dobson
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-27 07:08 UTC by Poornima Singour
Modified:	2023-09-22 04:09 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-02-11 00:41:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5972661	0	None	None	None	2022-08-04 12:53:17 UTC

Comment 2 Poornima Singour 2022-05-30 08:15:39 UTC

Hi Team, Can you please let us know the status of this case ? Is there any workaround available for this issue?

Comment 10 Jonathan Dobson 2022-06-14 22:18:25 UTC

This is pretty easy to reproduce, I tried the simplest scenario on a single node using local-up-cluster.sh:

# kubectl get nodes
NAME        STATUS   ROLES    AGE   VERSION
127.0.0.1   Ready    <none>   16s   v1.25.0-alpha.0.814+b7337cb171ab12-dirty

1) create a pod with single pvc (using the csi hostpath driver in this case)
2) kill kubelet
3) force delete the pod
4) restart kubelet and wait for pod to delete
5) check the mount and kubelet log.

The 'not a directory' errors quickly grew in the kubelet log:

# grep 'not a directory' /tmp/kubelet.log | wc -l
300

Line by line walkthrough of the log messages below.

getVolumesFromPodDir:

I0613 22:12:46.095863   81722 reconciler.go:716] "Volume path from volume plugin directory" podName="c17ac5dd-1c61-40db-9e64-6b61103dbd23" volumePath="/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d"

ReconstructVolume:   

I0613 22:12:46.097166   81722 operation_executor.go:1042] Starting operationExecutor.ReconstructVolume for file volume on pod "c17ac5dd-1c61-40db-9e64-6b61103dbd23"

ContructVolumeSpec -> loadVolumeData:

I0613 22:12:46.097174   81722 csi_plugin.go:515] kubernetes.io/csi: plugin.ConstructVolumeSpec [pv.Name=pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d, path=/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d]
I0613 22:12:46.097188   81722 csi_util.go:80] kubernetes.io/csi: loading volume data file [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/vol_data.json]
I0613 22:12:46.097240   81722 csi_plugin.go:522] kubernetes.io/csi: plugin.ConstructVolumeSpec extracted [map[string]string{"attachmentID":"csi-b4a39ac1071f586ac52f826c974aaf60747b0decf4812fcef0acb7940e780e94", "driverName":"hostpath.csi.k8s.io", "nodeName":"127.0.0.1", "specVolID":"pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d", "volumeHandle":"fef36ea0-eb60-11ec-91f2-72e3dd2e9bdd", "volumeLifecycleMode":"Persistent"}]

NewMounter -> saveVolumeData:

I0613 22:12:46.097301   81722 csi_mounter.go:87] kubernetes.io/csi: mounter.GetPath generated [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/mount]
I0613 22:12:46.097316   81722 csi_plugin.go:444] kubernetes.io/csi: created path successfully [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d]
I0613 22:12:46.097328   81722 csi_util.go:63] kubernetes.io/csi: saving volume data file [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/vol_data.json]
I0613 22:12:46.097390   81722 csi_util.go:72] kubernetes.io/csi: volume data file saved successfully [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/vol_data.json]
I0613 22:12:46.097434   81722 csi_plugin.go:480] kubernetes.io/csi: mounter created successfully
I0613 22:12:46.097446   81722 csi_mounter.go:87] kubernetes.io/csi: mounter.GetPath generated [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/mount]

syncStates: reconstruct fails because /mount subdir does not exist:

I0613 22:12:46.097494   81722 reconciler.go:400] "Could not construct volume information, cleaning up mounts" podName=c17ac5dd-1c61-40db-9e64-6b61103dbd23 volumeSpecName="pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d" err="could not check whether the volume \"kubernetes.io/csi/hostpath.csi.k8s.io^fef36ea0-eb60-11ec-91f2-72e3dd2e9bdd\" (spec.Name: \"pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d\") pod \"c17ac5dd-1c61-40db-9e64-6b61103dbd23\" (UID: \"c17ac5dd-1c61-40db-9e64-6b61103dbd23\") is mounted with: stat /var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/mount: no such file or directory"

cleanupMounts:

I0613 22:12:46.097509   81722 reconciler.go:433] "Reconciler sync states: could not find volume information in desired state, clean up the mount points" podName=c17ac5dd-1c61-40db-9e64-6b61103dbd23 volumeSpecName="pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d"

UnmountVolume -> NewUnmounter:

I0613 22:12:46.097527   81722 csi_plugin.go:486] kubernetes.io/csi: setting up unmounter for [name=pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d, podUID=c17ac5dd-1c61-40db-9e64-6b61103dbd23]

That calls GetPath and loadVolumeData again:

I0613 22:12:46.097569   81722 csi_util.go:80] kubernetes.io/csi: loading volume data file [/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/vol_data.json]

TearDownAt: attempt to unmount volume

I0613 22:12:46.097749   81722 csi_mounter.go:364] kubernetes.io/csi: Unmounter.TearDownAt(/var/lib/kubelet/pods/c17ac5dd-1c61-40db-9e64-6b61103dbd23/volumes/kubernetes.io~csi/pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d/mount)
E0613 22:12:46.097789   81722 operation_generator.go:848] UnmountVolume.MarkVolumeMountAsUncertain failed for volume "" (UniqueName: "pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d") pod "c17ac5dd-1c61-40db-9e64-6b61103dbd23" (UID: "c17ac5dd-1c61-40db-9e64-6b61103dbd23") : no volume with the name "pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d" exists in the list of attached volumes

This fails because the CSI driver is not registered yet:

E0613 22:12:46.097832   81722 nestedpendingoperations.go:335] Operation for "{volumeName:pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d podName:c17ac5dd-1c61-40db-9e64-6b61103dbd23 nodeName:}" failed. No retries permitted until 2022-06-13 22:12:46.597812352 +0000 UTC m=+21.850115672 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "" (UniqueName: "pvc-d4c16eb5-a9b1-444e-ab14-51fb1011a96d") pod "c17ac5dd-1c61-40db-9e64-6b61103dbd23" (UID: "c17ac5dd-1c61-40db-9e64-6b61103dbd23") : kubernetes.io/csi: Unmounter.TearDownAt failed to get CSI client: driver name hostpath.csi.k8s.io not found in the list of registered CSI drivers

And then we get to cleanupOrphanedPodDirs:

E0613 22:12:47.038491   81722 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"c17ac5dd-1c61-40db-9e64-6b61103dbd23\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1
I0613 22:12:47.038509   81722 kubelet_volumes.go:247] "Orphan pod" err="orphaned pod \"c17ac5dd-1c61-40db-9e64-6b61103dbd23\" found, but error occurred when trying to remove the volumes dir: not a directory"

So the sequence of events as I currently understand it is:
1) kubelet calls ReconstructVolume after restart because getVolumesFromPodDir finds the volume plugin directory.
2) The /mount subdir gets removed, either by cleanupOrphanedPodDirs or by a successful unmount (there is not always a log message for this event).
3) ReconstructVolume reads vol_data.json to construct the volume spec and creates a new mounter object, which writes an update to vol_data.json (saveVolumeData).
4) syncStates fails to reconstruct the volume because the /mount subdir does not exist.
5) The volume is marked as uncertain, and TearDownAt is called in an attempt to unmount it, but this fails because the volume is not in the list of attached volumes.
6) cleanupOrphanedPodDirs attempts to remove the orphaned volumes directory, but fails because vol_data.json still exists, because TearDownAt never succeeded and therefore never removed the file.

This generally seems like a lack of coordination between the volume reconstruction and orphaned pod cleanup logic.
Orphaned pod cleanup depends on reconstruction to remove vol_data.json, but if the /mount subdir is removed then reconstruction will fail and vol_data.json will never get removed.

Comment 13 Jonathan Dobson 2022-08-17 18:51:37 UTC

I re-ran the test against lastest kubernetes master branch today and can no longer reproduce the issue.

root@ubuntu2204:/workspace/csi-driver-host-path# kubectl get nodes
NAME        STATUS   ROLES    AGE   VERSION
127.0.0.1   Ready    <none>   60m   v1.26.0-alpha.0.21+93054957c13a53-dirty

root@ubuntu2204:/workspace/csi-driver-host-path# kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
csi-hostpath-socat-0    1/1     Running   0          56s
csi-hostpathplugin-0    8/8     Running   0          57s
my-csi-app              1/1     Running   0          21s
snapshot-controller-0   1/1     Running   0          61s
root@ubuntu2204:/workspace/csi-driver-host-path# mount | grep csi
/dev/mapper/ubuntu--vg-ubuntu--lv on /var/lib/kubelet/pods/a08c449e-efbe-4f7a-b3bd-4edf377706ff/volumes/kubernetes.io~csi/pvc-e7c4206e-d47d-4e13-9ca8-c1c983bcd806/mount type ext4 (rw,relatime)
root@ubuntu2204:/workspace/csi-driver-host-path# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS      REASON   AGE
pvc-e7c4206e-d47d-4e13-9ca8-c1c983bcd806   1Gi        RWO            Delete           Bound    default/csi-pvc   csi-hostpath-sc            34s

After killing kubelet, deleting the pod, and starting kubelet back up, the pod gets properly cleaned up with no 'not a directory' errors in kubelet.log.

root@ubuntu2204:/workspace/csi-driver-host-path# kubectl get pods
NAME                    READY   STATUS    RESTARTS      AGE
csi-hostpath-socat-0    1/1     Running   0             4m52s
csi-hostpathplugin-0    8/8     Running   1 (49s ago)   4m53s
snapshot-controller-0   1/1     Running   0             4m57s
root@ubuntu2204:/workspace/csi-driver-host-path# mount | grep csi
root@ubuntu2204:/workspace/csi-driver-host-path# grep 'not a directory' /tmp/kubelet.log
root@ubuntu2204:/workspace/csi-driver-host-path# ls /var/lib/kubelet/pods/a08c449e-efbe-4f7a-b3bd-4edf377706ff/
ls: cannot access '/var/lib/kubelet/pods/a08c449e-efbe-4f7a-b3bd-4edf377706ff/': No such file or directory

Whatever fixed this behavior, it was between these two commits:

$ git log 93054957c13a53 | head -9
commit 93054957c13a5398a1b0eabb3d9c77f896e46093
Merge: 619f1cf5520 44a59710ea2
Author: Kubernetes Prow Robot <k8s-ci-robot.github.com>
Date:   Wed Aug 17 09:09:11 2022 -0700

    Merge pull request #111859 from omertuc/fixparallelclosure2
    
    Fix failing `test/e2e/storage/csi_mock_volume.go` test

$ git log b7337cb171ab12 | head -9
commit b7337cb171ab126cb892aacdab2816017a290841
Merge: 4988bfce914 061b8e80494
Author: Kubernetes Prow Robot <k8s-ci-robot.github.com>
Date:   Fri Jun 3 13:32:20 2022 -0700

    Merge pull request #110235 from jsturtevant/fix-new-restricted-pod-tests
    
    test: RunAsUser causes pods to not start on Windows

And we had a few fixes for volume reconstruction merged in kubernetes 1.25. These are the 2 main PR's that I recall did significant work in this area, It's likely that one of these PR's already fixed it upstream.

https://github.com/kubernetes/kubernetes/pull/110670
https://github.com/kubernetes/kubernetes/pull/108692

Comment 17 Jonathan Dobson 2022-10-06 20:24:17 UTC

In my last set of experiments, this scenario was fixed (i.e. kubelet restart) on k8s 1.25 / OCP 4.12:

1) create a pod with single pvc (using the csi hostpath driver in this case)
2) kill kubelet
3) force delete the pod
4) restart kubelet and wait for pod to delete
5) check the mount and kubelet log.

However, if the node is restarted, and therefore the directory is unmounted when kubelet starts back up, then we can still get the 'not a directory' errors described in this bug, along with this 'is not mounted' error:

I0818 18:55:02.696639 1821557 reconciler.go:506] "Could not construct volume information, cleaning up mounts" podName=50c19ad6-0a5d-4684-a02e-86c1365479c9 volumeSpecName="pvc-e7c4206e-d47d-4e13-9ca8-c1c983bcd806" err="volume: \"kubernetes.io/csi/hostpath.csi.k8s.io^ac6d14b9-1e54-11ed-bd38-fabf9fe5c695\" is not mounted"

There is an upstream issue for this part of the problem:
  https://github.com/kubernetes/kubernetes/issues/111933

Comment 18 Jonathan Dobson 2022-10-06 20:31:42 UTC

And this is the commit that seems to have fixed the first part of this issue (kubelet restart):
  https://github.com/kubernetes/kubernetes/commit/daa181d92e85091b1e537d285139168d1ad5f10b
I did not have that fix in my workspace at the time of this test: https://bugzilla.redhat.com/show_bug.cgi?id=2090974#c10

Comment 19 Jonathan Dobson 2022-10-10 15:44:18 UTC

To be clear about the scope of this bug: the symptoms are excessive "orphaned pod" error messages in kubelet log, and leftover directories under /var/lib/kubelet/pods that can't be cleaned up automatically after kubelet (or node) restarts.
If you're seeing any symptoms aside from that, it will most likely not be fixed by this bug.

Comment 24 Jonathan Dobson 2023-02-11 00:41:02 UTC

We no longer track bugs in Bugzilla, migrated to Jira:
https://issues.redhat.com/browse/OCPBUGS-7345

Note You need to log in before you can comment on or make changes to this bug.