1. Connect to the openshift online cluster (ssh tower)
2. From there, connect to a node in the cluster ossh -l root "starter-us-west-2-node-compute-53089"
3. This message shows up over 1k times in the logs: "Nov 01 16:26:22 ip-172-31-52-107.us-west-2.compute.internal atomic-openshift-node: E1101 16:26:22.301369 16882 kubelet_volumes.go:114] Orphaned pod "80b3aeaf-83c7-11e7-857a-0ab8769191d3" found, but volume paths are still present on disk."
4. you can count the number of occurrences with:
journalctl --since=-60m | grep 80b3aeaf-83c7-11e7-857a-0ab8769191d3 | wc -l
That error message should GO AWAY. A couple options:
1) Cleanup the mounts on the node *** (best)
2) Cleanup/ stop the log messages, or stop displaying them. This might include only logging the error a few times before stopping.
Description of problem:
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Node Log (of failed PODs):
StorageClass Dump (if StorageClass used by PV/PVC):
*** Bug 1509960 has been marked as a duplicate of this bug. ***
Putting this back to medium. I saw a whole lot of these on nodes that were having differing troubles. I'm wondering if cleaning these orphans up (not just the log message) won't reduce some of the load on the node.
this bug is reported on OCP 3.7, and scheduled to be fixed in 3.8.
We're currently on 3.6 .Once fixed we would like a hotfix for 3.6 on this.
currently hitting a volume detach issue, that seems to be caused by:
Nov 27 10:33:19 <node_name> atomic-openshift-node: E1127 10:33:19.226120 3948 reconciler.go:437] Could not construct volume information: MountVolume.NewMounter failed for volume "kubernetes.io/fc/" (spec.Name: "fc-0012") pod "2663ce08-d054-11e7-b7ee-00505684022a" (UID: "2663ce08-d054-11e7-b7ee-00505684022a") with: empty lun
Nov 27 12:19:40 <node_name> atomic-openshift-node: E1127 12:19:40.082043 3948 kubelet_volumes.go:114] Orphaned pod "2663ce08-d054-11e7-b7ee-00505684022a" found, but volume paths are still present on disk.
Before opening another BZ (and since https://bugzilla.redhat.com/show_bug.cgi?id=1510167 has been marked as duplicate of this one), Just wondering if this might be related to same root cause.
Please open separate bug for fiber channel problem. I am sure volume not attaching correctly for fiber channel volume is unrelated to this bug.
The reason Orphaned pod message appears in general is because - when a user shuts down the node and pods from node are either migrated or they stop running(because node itself is down) and then when node is restarted, some of those now migrated or not running pod directories are still left on the disk and they don't get cleaned up. At this point, though it becomes a logging problem on the node(I completely understand it is still a headache) rather than actual problem that affects the node.
Yes that is fine.
I lgtmed upstream PR and tested with various scenarios on EBS and it indeed fixes our problem.
But the PR also modifies Map/Unmap feature introduced in block device feature, that needs further testing. @mtanino is going to help test that and then we can merge this.
origin master PR at https://github.com/openshift/origin/pull/18554
Merged in Origin
It is failed on ocp based on the version of below:
features: Basic-Auth GSSAPI Kerberos SPNEGO
Verification steps are as below:
1.oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git
2.create a dynamic pvc ebsc
3.oc volume dc/ruby-ex --add --name=v1 -t pvc --claim-name=ebsc
4.After pod is running, stop node service
5.oc edit dc ruby-ex
update the replicas from 1 to 0
6.After pod is deleted, start node service
7. After 30 min, check /var/log/message on the node
Feb 23 08:49:37 ip-172-18-13-207 atomic-openshift-node: I0223 08:49:37.476633 75718 kubelet_volumes.go:130] Orphan pod: Orphaned pod "0591e66b-1870-11e8-ba71-0e63240230b4" found, but volume paths are still present on disk
8.ebs voume still attached to the node
9.Can see the log from /var/log/message on master that
Feb 23 08:16:30 ip-172-18-14-168 atomic-openshift-master-controllers: I0223 08:16:30.475869 17036 operation_generator.go:410] DetachVolume.Detach succeeded for volume "pvc-d72e54ac-186f-11e8-ba71-0e63240230b4" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0e2db1f14e7423542") on node "ip-172-18-13-207.ec2.internal"
Created attachment 1399877 [details]
Created attachment 1399878 [details]
(logs were cut when the faulty pod ruby-ex-2-gkz6s (0591e66b-1870-11e8-ba71-0e63240230b4) starts)
Interesting things in the logs:
- master @ 08:08:26: master is unusually quiet. No log message between 08:08:26 and 08:16:19. We don't know what happened for 8 minutes!
- node @ 08:15:38: openshift-node service was restarted
- node @ 08:16:09: first "volume paths are still present on disk" in the node log. Was the pod deleted during restart?
- master @ 08:16:30: master detached the volume. Was it unmounted? There is no TearDown in node logs!
- node @ 09:21:02: node rebooted. After that, kubelet cleaned up its orphaned volumes just fine (they were unmounted by reboot). Were they mounted before? Were they detached and left mounted?!
Tried to reproduce several times:
1. Start a pod that uses AWS EBS.
2. Kill -9 openshift node process on the node where the pod runs.
3. Delete the pod. Wait until it's deleted from API server.
4. Restart node service.
Once I could see "Orphan pod: Orphaned pod "28cf4fe3-18ae-11e8-bc08-0e6a1c5923ba" found, but volume paths are still present on disk", but node fixes it in ~20 minutes.
Most of the time the volumes are unmounted just fine.
From the node log it looks like no VolumeManager reconciler was not started:
- there is no "Reconciler: start to sync state" in the log
- there is no "Get volumes from pod directory" in the log
Is the log corrupted???
Created attachment 1401413 [details]
node logs 2
Reproduced for 2nd time. I stopped node service, & deleted a pod. The pod was in "Terminating" state for a long time, then went to "Unknown". Then I started the node.
Affected PV: 23532d92-1bc8-11e8-8b98-0ef0bab22e4c
openshift-node was started at
Feb 27 09:26:29 ip-172-18-6-253 systemd: Starting OpenShift Node...
desired_state_of_world_populator works :
Feb 27 09:26:36 ip-172-18-6-253 atomic-openshift-node: I0227 09:26:35.961020 37193 desired_state_of_world_populator.go:299] Added volume "v1" (volSpec="pvc-c6758f75-1bc7-11e8-8b98-0ef0bab22e4c") for pod "23532d92-1bc8-11e8-8b98-0ef0bab22e4c" to desired state
Feb 27 09:26:36 ip-172-18-6-253 atomic-openshift-node: I0227 09:26:35.961759 37193 reconciler.go:154] Reconciler: start to sync state
kubelet gets notification about the deleted pod:
Feb 27 09:26:43 ip-172-18-6-253 atomic-openshift-node: I0227 09:26:43.279151 37193 kubelet.go:1846] SyncLoop (REMOVE, "api"): "ruby-ex-1-dqzdn_default(23532d92-1bc8-11e8-8b98-0ef0bab22e4c)"
But it does not unmount the volume.
I think I got the root cause:
0. Pod with EBS volume runs.
1. Node gets down
2. User (or something else) deletes the pod
3. The pod is Terminating, API server waits until the pod is deleted by (stopped) kubelet.
4. Node starts
5. desiredStateOfWorldPopulator scans pods and adds this pod to DSW
- volumehelper.IsPodTerminated says the pod is *not* terminated, because it has running contatiners.
6. reconciler.syncStates() runs.
6.1 it finds mounted volume
6.2 but it does *not* add it to actual state of world because it's in DSW:
When kubelet fully starts, it kills the pod and cleans DSW, but the volume never reached AWS so nothing is unmounted.
I created upstream issue: https://github.com/kubernetes/kubernetes/issues/60645
Investigation will continue there.
Is there an ask for fixing this in OCP-3.9? The fix for this has been merged 1.10 and whenever we rebase 3.10 with latest 1.10, the fix will be available.
Eric what do you think?
Closing since there's a 3.7 bug
*** This bug has been marked as a duplicate of bug 1583707 ***