Bug 1510167

Summary: [GSS] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present on disk.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Thom Carlin <tcarlin>
Component: kubernetesAssignee: Raghavendra Talur <rtalur>
Status: CLOSED WONTFIX QA Contact: Rachael <rgeorge>
Severity: low Docs Contact:
Priority: medium    
Version: cns-3.6CC: andcosta, annair, bandrade, bkunal, clichybi, cstark, dmoessne, fgrosjea, hchiramm, ikke, jarrpa, jmulligan, jroberts, knakayam, kramdoss, ksalunkh, ksubrahm, madam, mmariyan, nchilaka, ndevos, pprakash, psony, rcyriac, rekhan, rhs-bugs, rreddy, rtalur, sgaikwad, ssadhale, swachira
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-21 14:27:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1558600    
Bug Blocks: 1724792, 1573420, 1622458, 1641915, 1642792    

Description Thom Carlin 2017-11-06 20:26:24 UTC
Description of problem:

While investigating FailedSync errors, found about 10 "kubelet_volumes.go:114] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present on disk." messages during "systemctl status -l atomic-openshift-node" for CNS 3.6-backed pods

Version-Release number of selected component (if applicable):

3.6

How reproducible:

Uncertain

Steps to Reproduce: [Uncertain]
1. systemctl status -l atomic-openshift-node
2.
3.

Actual results:

kubelet_volumes.go:114] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present on disk.

Expected results:

No errors
No remnants of pods left

Additional info:

Tentative workaround (for each orphaned pod):
1) Note pod_uuid
2) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs
  * Note pvc_uuid
3) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>>
  * Directory should be empty
4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs
   * Directory should be removed
   * All parent directories up to /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs (inclusive) should also disappear
   * Orphan message for this pod no longer appears for this pod_uuid

Comment 2 Thom Carlin 2017-11-06 20:51:49 UTC
Correction:
4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>>

Additionally
oc get pvc --all-namespaces | egrep <<pvc_uuid>> should not return anything
oc get pv | egrep <<pvc_uuid>> should not return anything

Comment 3 Humble Chirammal 2018-01-11 11:54:49 UTC
We are trying to come up with a common mechanism to resolve the stale mounts across different FSs like NFS, GlusterFS ..etc. I will keep you posted. The patch is in review state .

Comment 7 Humble Chirammal 2018-02-05 17:01:41 UTC
This is fixed in OCP 3.9 builds. Moving to ON_QA.

Comment 8 Humble Chirammal 2018-02-07 07:28:08 UTC
I have tried to reproduce this issue with above patches in place and this issue is not reproducible. 

The volumes are gone after pod deletion in case of unsuccessful pod launch.

Comment 13 Rachael 2018-03-08 08:02:11 UTC
How can this issue be reproduced to test the fix?

Comment 14 Humble Chirammal 2018-03-12 16:05:21 UTC
(In reply to Rachael from comment #13)
> How can this issue be reproduced to test the fix?

One verification model can be going through the logs for `orphaned pod` and check the pod uuid matches with the pod which used gluster pvc claim. This is a generic error message and could be available in a system for any volume types, we just need to check for glusterfs pvc used pod.

Also looking at `ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs` path for POD UUIDs which are no longer in the system or running can also help us to verify the bug.

Comment 20 Thom Carlin 2018-05-01 13:37:01 UTC
More information for end-users:

On OCP node running the pod:

1) df | grep "<<pod_uuid>>
df: ‘/var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>’: Transport endpoint is not connected

2) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs
cannot access /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>: Transport endpoint is not connected
Total 0
drwxr-x---. 3 root root 54 <<datestamp>> .
drwxr-x---. 5 root root 96 <<datestamp>> ..
d?????????? ? ?    ?     ?            ? pvc-<<pvc_uuid>>
  * Note pvc_uuid

3) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>>
ls: cannot access /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>: Transport endpoint is not connected
  * Directory should be empty

4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>


Additionally
A) oc get pvc --all-namespaces | egrep <<pvc_uuid>> should not return anything
B) oc get pv | egrep <<pvc_uuid>> should not return anything
C) lsof | grep /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>> may help to isolate the root cause

Comment 21 Thom Carlin 2018-05-01 13:46:53 UTC
D) Using the PID(s) from C), ps -fp <<glusterfs_pid>>
   * Note the log file path
E) less <<glusterfs_logfile_path>>
   * Note the errors which led to "Transport endpoint is not connected"

Comment 30 Ilkka Tengvall 2018-07-13 06:47:12 UTC
This would need some cleanup script. In my customer's env there are e.g. 86 of these per node, and that multiplies to quite many quickly. Could we deliver a cleanup script for anyone to run e.g. from cron/tower until this is fixed?

Here's what I used to find these:

export pod=`sudo /bin/journalctl -u atomic-openshift-node --since "1 hour ago" | grep "Orphaned pod"| tail -1 | sed 's/.Orphaned pod "([^"]).*/\1/'`; echo $pod

-> Will output:
000d0d5b-47de-11e8-8a1d-001dd8b71e6f

To see the dirs, especially the important volumes/ dir under that:

sudo ls -la /var/lib/origin/openshift.local.volumes/pods/$pod/

Also see if it's mounted: 
df|grep $pod

now that would need to be enhanced further, but I'm leaving on PTO and don't have the time now. Problem with the above is that it only shows one at the time.

Comment 31 Ilkka Tengvall 2018-07-13 07:45:30 UTC
And here's btw an ansible command to find the affected nodes from your cluster:

ansible -i ocp/hosts.yml nodes -b -m shell -a '/bin/journalctl -u atomic-openshift-node --since yesterday | /bin/grep "Orphaned pod" | tail -1 ' -f 15

It will output the affected nodes like this:

xxx.yyy.local | SUCCESS | rc=0 >>
Jul 13 10:24:16 xxx.yyy.local atomic-openshift-node[28429]: E0713 10:24:16.909897   28429 kubelet_volumes.go:128] Orphaned pod "000d0d5b-47de-11e8-8a1d-001dd8b71e6f" found, but volume paths are still present on disk : There were a total of 86 errors similar to this. Turn up verbosity to see them.

Comment 34 Michael Adam 2018-09-19 22:07:09 UTC
Is this completely fixed by BZ #1558600 or is there more left?

If this is tracking #1558600, then we should close this as CURRENTRELEASE, since it's already fixed in OCP 3.10.

Comment 35 Michael Adam 2018-09-20 13:44:21 UTC
Even if an additional fix is needed, we can not fix it in 3.11.0.
==> moving out.

And adapting severity, since this is mostly cosmetic.

Leaving needinfo on Humble to verify whether a fix is needed.

Comment 36 Humble Chirammal 2018-09-20 13:52:37 UTC
(In reply to Michael Adam from comment #34)
> Is this completely fixed by BZ #1558600 or is there more left?
> 
> If this is tracking #1558600, then we should close this as CURRENTRELEASE,
> since it's already fixed in OCP 3.10.

Yes, the OCP bug has been closed with recent errata 
https://access.redhat.com/errata/RHBA-2018:1816

However I doubt all the corner cases are fixed due to the issues I have seen in upstream. For eg. https://github.com/kubernetes/kubernetes/issues/45464


We can retest this with OCP 3.11 and proceed accordingly. Thats the best thing I can think of now.

Comment 49 kedar 2019-06-17 10:47:49 UTC
Hello,

The cu has provided the following details which were asked and could help in proceeding with the troubleshooting of this issue. The details are as follows:

-------------------------------------------------------

=> Which storage you are using for persistent storage in your OCP environment?

We are using Gluster 

=> Are you seeing the same error message in your all the nodes?

Yes


=> if you are running OCS n your environment get the OCS/gluster version.

glusterfs-server-3.12.2-18.2.el7rhgs.x86_64

-------------------------------------------------------

I guess this information is sufficient to retest the OCP 3.11 cluster.

Awaiting for the updates.


Thanks,
Kedar

Comment 55 Red Hat Bugzilla 2024-02-04 04:25:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days