Bug 1510167 - [GSS] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present on disk.
Summary: [GSS] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present o...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: kubernetes
Version: cns-3.6
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: Raghavendra Talur
QA Contact: Rachael
URL:
Whiteboard:
Depends On: 1558600
Blocks: 1724792 1573420 1622458 OCS-3.11.1-devel-triage-done 1642792
TreeView+ depends on / blocked
 
Reported: 2017-11-06 20:26 UTC by Thom Carlin
Modified: 2024-02-04 04:25 UTC (History)
31 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-21 14:27:27 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 60645 0 'None' closed Kubelet leaves orphaned directories and mounts after downtime 2021-02-08 05:27:13 UTC

Description Thom Carlin 2017-11-06 20:26:24 UTC
Description of problem:

While investigating FailedSync errors, found about 10 "kubelet_volumes.go:114] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present on disk." messages during "systemctl status -l atomic-openshift-node" for CNS 3.6-backed pods

Version-Release number of selected component (if applicable):

3.6

How reproducible:

Uncertain

Steps to Reproduce: [Uncertain]
1. systemctl status -l atomic-openshift-node
2.
3.

Actual results:

kubelet_volumes.go:114] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present on disk.

Expected results:

No errors
No remnants of pods left

Additional info:

Tentative workaround (for each orphaned pod):
1) Note pod_uuid
2) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs
  * Note pvc_uuid
3) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>>
  * Directory should be empty
4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs
   * Directory should be removed
   * All parent directories up to /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs (inclusive) should also disappear
   * Orphan message for this pod no longer appears for this pod_uuid

Comment 2 Thom Carlin 2017-11-06 20:51:49 UTC
Correction:
4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>>

Additionally
oc get pvc --all-namespaces | egrep <<pvc_uuid>> should not return anything
oc get pv | egrep <<pvc_uuid>> should not return anything

Comment 3 Humble Chirammal 2018-01-11 11:54:49 UTC
We are trying to come up with a common mechanism to resolve the stale mounts across different FSs like NFS, GlusterFS ..etc. I will keep you posted. The patch is in review state .

Comment 7 Humble Chirammal 2018-02-05 17:01:41 UTC
This is fixed in OCP 3.9 builds. Moving to ON_QA.

Comment 8 Humble Chirammal 2018-02-07 07:28:08 UTC
I have tried to reproduce this issue with above patches in place and this issue is not reproducible. 

The volumes are gone after pod deletion in case of unsuccessful pod launch.

Comment 13 Rachael 2018-03-08 08:02:11 UTC
How can this issue be reproduced to test the fix?

Comment 14 Humble Chirammal 2018-03-12 16:05:21 UTC
(In reply to Rachael from comment #13)
> How can this issue be reproduced to test the fix?

One verification model can be going through the logs for `orphaned pod` and check the pod uuid matches with the pod which used gluster pvc claim. This is a generic error message and could be available in a system for any volume types, we just need to check for glusterfs pvc used pod.

Also looking at `ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs` path for POD UUIDs which are no longer in the system or running can also help us to verify the bug.

Comment 20 Thom Carlin 2018-05-01 13:37:01 UTC
More information for end-users:

On OCP node running the pod:

1) df | grep "<<pod_uuid>>
df: ‘/var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>’: Transport endpoint is not connected

2) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs
cannot access /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>: Transport endpoint is not connected
Total 0
drwxr-x---. 3 root root 54 <<datestamp>> .
drwxr-x---. 5 root root 96 <<datestamp>> ..
d?????????? ? ?    ?     ?            ? pvc-<<pvc_uuid>>
  * Note pvc_uuid

3) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>>
ls: cannot access /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>: Transport endpoint is not connected
  * Directory should be empty

4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>


Additionally
A) oc get pvc --all-namespaces | egrep <<pvc_uuid>> should not return anything
B) oc get pv | egrep <<pvc_uuid>> should not return anything
C) lsof | grep /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>> may help to isolate the root cause

Comment 21 Thom Carlin 2018-05-01 13:46:53 UTC
D) Using the PID(s) from C), ps -fp <<glusterfs_pid>>
   * Note the log file path
E) less <<glusterfs_logfile_path>>
   * Note the errors which led to "Transport endpoint is not connected"

Comment 30 Ilkka Tengvall 2018-07-13 06:47:12 UTC
This would need some cleanup script. In my customer's env there are e.g. 86 of these per node, and that multiplies to quite many quickly. Could we deliver a cleanup script for anyone to run e.g. from cron/tower until this is fixed?

Here's what I used to find these:

export pod=`sudo /bin/journalctl -u atomic-openshift-node --since "1 hour ago" | grep "Orphaned pod"| tail -1 | sed 's/.Orphaned pod "([^"]).*/\1/'`; echo $pod

-> Will output:
000d0d5b-47de-11e8-8a1d-001dd8b71e6f

To see the dirs, especially the important volumes/ dir under that:

sudo ls -la /var/lib/origin/openshift.local.volumes/pods/$pod/

Also see if it's mounted: 
df|grep $pod

now that would need to be enhanced further, but I'm leaving on PTO and don't have the time now. Problem with the above is that it only shows one at the time.

Comment 31 Ilkka Tengvall 2018-07-13 07:45:30 UTC
And here's btw an ansible command to find the affected nodes from your cluster:

ansible -i ocp/hosts.yml nodes -b -m shell -a '/bin/journalctl -u atomic-openshift-node --since yesterday | /bin/grep "Orphaned pod" | tail -1 ' -f 15

It will output the affected nodes like this:

xxx.yyy.local | SUCCESS | rc=0 >>
Jul 13 10:24:16 xxx.yyy.local atomic-openshift-node[28429]: E0713 10:24:16.909897   28429 kubelet_volumes.go:128] Orphaned pod "000d0d5b-47de-11e8-8a1d-001dd8b71e6f" found, but volume paths are still present on disk : There were a total of 86 errors similar to this. Turn up verbosity to see them.

Comment 34 Michael Adam 2018-09-19 22:07:09 UTC
Is this completely fixed by BZ #1558600 or is there more left?

If this is tracking #1558600, then we should close this as CURRENTRELEASE, since it's already fixed in OCP 3.10.

Comment 35 Michael Adam 2018-09-20 13:44:21 UTC
Even if an additional fix is needed, we can not fix it in 3.11.0.
==> moving out.

And adapting severity, since this is mostly cosmetic.

Leaving needinfo on Humble to verify whether a fix is needed.

Comment 36 Humble Chirammal 2018-09-20 13:52:37 UTC
(In reply to Michael Adam from comment #34)
> Is this completely fixed by BZ #1558600 or is there more left?
> 
> If this is tracking #1558600, then we should close this as CURRENTRELEASE,
> since it's already fixed in OCP 3.10.

Yes, the OCP bug has been closed with recent errata 
https://access.redhat.com/errata/RHBA-2018:1816

However I doubt all the corner cases are fixed due to the issues I have seen in upstream. For eg. https://github.com/kubernetes/kubernetes/issues/45464


We can retest this with OCP 3.11 and proceed accordingly. Thats the best thing I can think of now.

Comment 49 kedar 2019-06-17 10:47:49 UTC
Hello,

The cu has provided the following details which were asked and could help in proceeding with the troubleshooting of this issue. The details are as follows:

-------------------------------------------------------

=> Which storage you are using for persistent storage in your OCP environment?

We are using Gluster 

=> Are you seeing the same error message in your all the nodes?

Yes


=> if you are running OCS n your environment get the OCS/gluster version.

glusterfs-server-3.12.2-18.2.el7rhgs.x86_64

-------------------------------------------------------

I guess this information is sufficient to retest the OCP 3.11 cluster.

Awaiting for the updates.


Thanks,
Kedar

Comment 55 Red Hat Bugzilla 2024-02-04 04:25:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.