Description of problem: While investigating FailedSync errors, found about 10 "kubelet_volumes.go:114] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present on disk." messages during "systemctl status -l atomic-openshift-node" for CNS 3.6-backed pods Version-Release number of selected component (if applicable): 3.6 How reproducible: Uncertain Steps to Reproduce: [Uncertain] 1. systemctl status -l atomic-openshift-node 2. 3. Actual results: kubelet_volumes.go:114] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present on disk. Expected results: No errors No remnants of pods left Additional info: Tentative workaround (for each orphaned pod): 1) Note pod_uuid 2) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs * Note pvc_uuid 3) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>> * Directory should be empty 4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs * Directory should be removed * All parent directories up to /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs (inclusive) should also disappear * Orphan message for this pod no longer appears for this pod_uuid
Correction: 4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>> Additionally oc get pvc --all-namespaces | egrep <<pvc_uuid>> should not return anything oc get pv | egrep <<pvc_uuid>> should not return anything
We are trying to come up with a common mechanism to resolve the stale mounts across different FSs like NFS, GlusterFS ..etc. I will keep you posted. The patch is in review state .
This is fixed in OCP 3.9 builds. Moving to ON_QA.
I have tried to reproduce this issue with above patches in place and this issue is not reproducible. The volumes are gone after pod deletion in case of unsuccessful pod launch.
How can this issue be reproduced to test the fix?
(In reply to Rachael from comment #13) > How can this issue be reproduced to test the fix? One verification model can be going through the logs for `orphaned pod` and check the pod uuid matches with the pod which used gluster pvc claim. This is a generic error message and could be available in a system for any volume types, we just need to check for glusterfs pvc used pod. Also looking at `ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs` path for POD UUIDs which are no longer in the system or running can also help us to verify the bug.
More information for end-users: On OCP node running the pod: 1) df | grep "<<pod_uuid>> df: ‘/var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>’: Transport endpoint is not connected 2) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs cannot access /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>: Transport endpoint is not connected Total 0 drwxr-x---. 3 root root 54 <<datestamp>> . drwxr-x---. 5 root root 96 <<datestamp>> .. d?????????? ? ? ? ? ? pvc-<<pvc_uuid>> * Note pvc_uuid 3) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>> ls: cannot access /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>: Transport endpoint is not connected * Directory should be empty 4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>> Additionally A) oc get pvc --all-namespaces | egrep <<pvc_uuid>> should not return anything B) oc get pv | egrep <<pvc_uuid>> should not return anything C) lsof | grep /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>> may help to isolate the root cause
D) Using the PID(s) from C), ps -fp <<glusterfs_pid>> * Note the log file path E) less <<glusterfs_logfile_path>> * Note the errors which led to "Transport endpoint is not connected"
This would need some cleanup script. In my customer's env there are e.g. 86 of these per node, and that multiplies to quite many quickly. Could we deliver a cleanup script for anyone to run e.g. from cron/tower until this is fixed? Here's what I used to find these: export pod=`sudo /bin/journalctl -u atomic-openshift-node --since "1 hour ago" | grep "Orphaned pod"| tail -1 | sed 's/.Orphaned pod "([^"]).*/\1/'`; echo $pod -> Will output: 000d0d5b-47de-11e8-8a1d-001dd8b71e6f To see the dirs, especially the important volumes/ dir under that: sudo ls -la /var/lib/origin/openshift.local.volumes/pods/$pod/ Also see if it's mounted: df|grep $pod now that would need to be enhanced further, but I'm leaving on PTO and don't have the time now. Problem with the above is that it only shows one at the time.
And here's btw an ansible command to find the affected nodes from your cluster: ansible -i ocp/hosts.yml nodes -b -m shell -a '/bin/journalctl -u atomic-openshift-node --since yesterday | /bin/grep "Orphaned pod" | tail -1 ' -f 15 It will output the affected nodes like this: xxx.yyy.local | SUCCESS | rc=0 >> Jul 13 10:24:16 xxx.yyy.local atomic-openshift-node[28429]: E0713 10:24:16.909897 28429 kubelet_volumes.go:128] Orphaned pod "000d0d5b-47de-11e8-8a1d-001dd8b71e6f" found, but volume paths are still present on disk : There were a total of 86 errors similar to this. Turn up verbosity to see them.
Is this completely fixed by BZ #1558600 or is there more left? If this is tracking #1558600, then we should close this as CURRENTRELEASE, since it's already fixed in OCP 3.10.
Even if an additional fix is needed, we can not fix it in 3.11.0. ==> moving out. And adapting severity, since this is mostly cosmetic. Leaving needinfo on Humble to verify whether a fix is needed.
(In reply to Michael Adam from comment #34) > Is this completely fixed by BZ #1558600 or is there more left? > > If this is tracking #1558600, then we should close this as CURRENTRELEASE, > since it's already fixed in OCP 3.10. Yes, the OCP bug has been closed with recent errata https://access.redhat.com/errata/RHBA-2018:1816 However I doubt all the corner cases are fixed due to the issues I have seen in upstream. For eg. https://github.com/kubernetes/kubernetes/issues/45464 We can retest this with OCP 3.11 and proceed accordingly. Thats the best thing I can think of now.
Hello, The cu has provided the following details which were asked and could help in proceeding with the troubleshooting of this issue. The details are as follows: ------------------------------------------------------- => Which storage you are using for persistent storage in your OCP environment? We are using Gluster => Are you seeing the same error message in your all the nodes? Yes => if you are running OCS n your environment get the OCS/gluster version. glusterfs-server-3.12.2-18.2.el7rhgs.x86_64 ------------------------------------------------------- I guess this information is sufficient to retest the OCP 3.11 cluster. Awaiting for the updates. Thanks, Kedar
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days