1508578 – excessive amount of 'orphaned path' messages on openshift online nodes.

Bug 1508578 - excessive amount of 'orphaned path' messages on openshift online nodes.

Summary: excessive amount of 'orphaned path' messages on openshift online nodes.

Keywords:
Status:	CLOSED DUPLICATE of bug 1583707
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Hemant Kumar
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1509960 (view as bug list)
Depends On:
Blocks:	1558600
TreeView+	depends on / blocked

Reported:	2017-11-01 17:34 UTC by Bradley Childs
Modified:	2018-06-07 05:26 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1558600 (view as bug list)
Environment:
Last Closed:	2018-05-29 15:29:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
master logs (9.16 MB, application/x-gzip) 2018-02-23 13:20 UTC, Jan Safranek	no flags	Details
node logs (3.80 MB, application/x-gzip) 2018-02-23 13:22 UTC, Jan Safranek	no flags	Details
node logs 2 (2.95 MB, application/x-gzip) 2018-02-27 15:27 UTC, Jan Safranek	no flags	Details
View All

Description Bradley Childs 2017-11-01 17:34:34 UTC

1. Connect to the openshift online cluster (ssh tower)
2. From there, connect to a node in the cluster ossh -l root "starter-us-west-2-node-compute-53089" 
3. This message shows up over 1k times in the logs: "Nov 01 16:26:22 ip-172-31-52-107.us-west-2.compute.internal atomic-openshift-node[16882]: E1101 16:26:22.301369   16882 kubelet_volumes.go:114] Orphaned pod "80b3aeaf-83c7-11e7-857a-0ab8769191d3" found, but volume paths are still present on disk."

4. you can count the number of occurrences with:
journalctl --since=-60m | grep 80b3aeaf-83c7-11e7-857a-0ab8769191d3 | wc -l

That error message should GO AWAY.  A couple options:

1) Cleanup the mounts on the node *** (best)
2) Cleanup/ stop the log messages, or stop displaying them.  This might include only logging the error a few times before stopping.

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 1 Bradley Childs 2017-11-06 17:37:18 UTC

*** Bug 1509960 has been marked as a duplicate of this bug. ***

Comment 2 Eric Paris 2017-11-06 19:18:51 UTC

Putting this back to medium. I saw a whole lot of these on nodes that were having differing troubles. I'm wondering if cleaning these orphans up (not just the log message) won't reduce some of the load on the node.

Comment 3 mmariyan 2017-11-10 10:44:18 UTC

Hi,

this bug is reported on OCP 3.7, and scheduled to be fixed in 3.8.
We're currently on 3.6 .Once fixed we would like a hotfix for 3.6 on this.

Comment 4 Nicolas Nosenzo 2017-11-27 15:40:02 UTC

Hi,
currently hitting a volume detach issue, that seems to be caused by:

Nov 27 10:33:19 <node_name> atomic-openshift-node[3948]: E1127 10:33:19.226120    3948 reconciler.go:437] Could not construct volume information: MountVolume.NewMounter failed for volume "kubernetes.io/fc/[]" (spec.Name: "fc-0012") pod "2663ce08-d054-11e7-b7ee-00505684022a" (UID: "2663ce08-d054-11e7-b7ee-00505684022a") with: empty lun
Nov 27 12:19:40 <node_name> atomic-openshift-node[3948]: E1127 12:19:40.082043    3948 kubelet_volumes.go:114] Orphaned pod "2663ce08-d054-11e7-b7ee-00505684022a" found, but volume paths are still present on disk.

Before opening another BZ (and since https://bugzilla.redhat.com/show_bug.cgi?id=1510167 has been marked as duplicate of this one), Just wondering if this might be related to same root cause.


Version:
atomic-openshift-node-3.5.5.31.24-1

Comment 6 Hemant Kumar 2017-11-27 16:31:04 UTC

Please open separate bug for fiber channel problem. I am sure volume not attaching correctly for fiber channel volume is unrelated to this bug.

The reason Orphaned pod message appears in general is because - when a user shuts down the node and pods from node are either migrated or they stop running(because node itself is down) and then when node is restarted, some of those now migrated or not running pod directories are still left on the disk and they don't get cleaned up. At this point, though it becomes a logging problem on the node(I completely understand it is still a headache) rather than actual problem that affects the node.

Comment 8 Hemant Kumar 2018-01-18 23:02:14 UTC

Yes that is fine.

Comment 11 Hemant Kumar 2018-02-02 21:55:53 UTC

I lgtmed upstream PR and tested with various scenarios on EBS and it indeed fixes our problem.

But the PR also modifies Map/Unmap feature introduced in block device feature, that needs further testing. @mtanino is going to help test that and then we can merge this.

Comment 12 hchen 2018-02-09 18:15:39 UTC

origin master PR at https://github.com/openshift/origin/pull/18554

Comment 13 Jan Safranek 2018-02-12 16:28:26 UTC

Merged in Origin

Comment 15 Chao Yang 2018-02-23 08:56:16 UTC

It is failed on ocp based on the version of below:
oc v3.9.0-0.48.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-14-168.ec2.internal:8443
openshift v3.9.0-0.48.0
kubernetes v1.9.1+a0ce1bc657

Verification steps are as below:
1.oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git
2.create a dynamic pvc ebsc
3.oc volume dc/ruby-ex --add --name=v1 -t pvc --claim-name=ebsc
4.After pod is running, stop node service 
5.oc edit dc ruby-ex
  update the replicas from 1 to 0
6.After pod is deleted, start node service
7. After 30 min, check /var/log/message on the node 
Feb 23 08:49:37 ip-172-18-13-207 atomic-openshift-node: I0223 08:49:37.476633   75718 kubelet_volumes.go:130] Orphan pod: Orphaned pod "0591e66b-1870-11e8-ba71-0e63240230b4" found, but volume paths are still present on disk

8.ebs voume still attached to the node 

9.Can see the log from /var/log/message on master that 
Feb 23 08:16:30 ip-172-18-14-168 atomic-openshift-master-controllers: I0223 08:16:30.475869   17036 operation_generator.go:410] DetachVolume.Detach succeeded for volume "pvc-d72e54ac-186f-11e8-ba71-0e63240230b4" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0e2db1f14e7423542") on node "ip-172-18-13-207.ec2.internal"

Comment 16 Jan Safranek 2018-02-23 13:20:41 UTC

Created attachment 1399877 [details]
master logs

Comment 17 Jan Safranek 2018-02-23 13:22:08 UTC

Created attachment 1399878 [details]
node logs

Comment 18 Jan Safranek 2018-02-23 13:43:45 UTC

(logs were cut when the faulty pod ruby-ex-2-gkz6s (0591e66b-1870-11e8-ba71-0e63240230b4) starts)

Interesting things in the logs:

- master @ 08:08:26: master is unusually quiet. No log message between 08:08:26 and 08:16:19. We don't know what happened for 8 minutes!

- node @ 08:15:38: openshift-node service was restarted

- node @ 08:16:09: first "volume paths are still present on disk" in the node log. Was the pod deleted during restart?

- master @ 08:16:30: master detached the volume. Was it unmounted? There is no TearDown in node logs!

- node @ 09:21:02: node rebooted. After that, kubelet cleaned up its orphaned volumes just fine (they were unmounted by reboot). Were they mounted before? Were they detached and left mounted?!

Comment 19 Jan Safranek 2018-02-23 16:40:25 UTC

Tried to reproduce several times:

1. Start a pod that uses AWS EBS.
2. Kill -9 openshift node process on the node where the pod runs.
3. Delete the pod. Wait until it's deleted from API server.
4. Restart node service.

Once I could see "Orphan pod: Orphaned pod "28cf4fe3-18ae-11e8-bc08-0e6a1c5923ba" found, but volume paths are still present on disk", but node fixes it in ~20 minutes.

Most of the time the volumes are unmounted just fine.

Comment 20 Jan Safranek 2018-02-26 15:07:47 UTC

From the node log it looks like no VolumeManager reconciler was not started:

- there is no "Reconciler: start to sync state" in the log 
- there is no "Get volumes from pod directory" in the log

Is the log corrupted???

Comment 21 Jan Safranek 2018-02-27 15:27:14 UTC

Created attachment 1401413 [details]
node logs 2

Reproduced for 2nd time. I stopped node service, & deleted a pod. The pod was in "Terminating" state for a long time, then went to "Unknown". Then I started the node.

Affected PV: 23532d92-1bc8-11e8-8b98-0ef0bab22e4c

openshift-node was started at
Feb 27 09:26:29 ip-172-18-6-253 systemd: Starting OpenShift Node...

desired_state_of_world_populator works :
Feb 27 09:26:36 ip-172-18-6-253 atomic-openshift-node: I0227 09:26:35.961020   37193 desired_state_of_world_populator.go:299] Added volume "v1" (volSpec="pvc-c6758f75-1bc7-11e8-8b98-0ef0bab22e4c") for pod "23532d92-1bc8-11e8-8b98-0ef0bab22e4c" to desired state

Reconciler starts:
Feb 27 09:26:36 ip-172-18-6-253 atomic-openshift-node: I0227 09:26:35.961759   37193 reconciler.go:154] Reconciler: start to sync state

kubelet gets notification about the deleted pod:
Feb 27 09:26:43 ip-172-18-6-253 atomic-openshift-node: I0227 09:26:43.279151   37193 kubelet.go:1846] SyncLoop (REMOVE, "api"): "ruby-ex-1-dqzdn_default(23532d92-1bc8-11e8-8b98-0ef0bab22e4c)"

But it does not unmount the volume.

Comment 22 Jan Safranek 2018-03-01 13:06:22 UTC

I think I got the root cause:

0. Pod with EBS volume runs.

1. Node gets down

2. User (or something else) deletes the pod

3. The pod is Terminating, API server waits until the pod is deleted by (stopped) kubelet.

4. Node starts

5. desiredStateOfWorldPopulator scans pods and adds this pod to DSW
  - volumehelper.IsPodTerminated says the pod is *not* terminated, because it has running contatiners.

6. reconciler.syncStates() runs.

  6.1 it finds mounted volume

  6.2 but it does *not* add it to actual state of world because it's in DSW:
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go#L366

When kubelet fully starts, it kills the pod and cleans DSW, but the volume never reached AWS so nothing is unmounted.

Comment 23 Jan Safranek 2018-03-01 15:30:17 UTC

I created upstream issue: https://github.com/kubernetes/kubernetes/issues/60645

Investigation will continue there.

Comment 24 Hemant Kumar 2018-03-16 14:55:19 UTC

Is there an ask for fixing this in OCP-3.9? The fix for this has been merged 1.10 and whenever we rebase 3.10 with latest 1.10, the fix will be available. 

Eric what do you think?

Comment 31 Bradley Childs 2018-05-29 15:29:23 UTC

Closing since there's a 3.7 bug

*** This bug has been marked as a duplicate of bug 1583707 ***

Note You need to log in before you can comment on or make changes to this bug.