Bug 1705657

Summary: Ephemeral Storage GC in Kubelet has SPOF on Runtime Calls
Product: OpenShift Container Platform Reporter: Steve Kuznetsov <skuznets>
Component: ContainersAssignee: Mrunal Patel <mpatel>
Status: CLOSED WONTFIX QA Contact: weiwei jiang <wjiang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, bparees, dwalsh, jokerman, mmccomas, nagrawal, pasik, rkrawitz
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: All   
OS: All   
See Also: https://github.com/kubernetes/kubernetes/issues/42164
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-19 18:11:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction
none
logs from kubelet and docker during a GC interval that resulted in eviction none

Description Steve Kuznetsov 2019-05-02 16:57:33 UTC
When the kubelet begins to meet an eviction threshold on ephemeral storage, it kicks off container and image garbage collection. However, there are a number of calls that must succeed for garbage collection to even begin, like listing containers or images. When these calls fail, the entire garbage collection is aborted and the kubelet begins to evict pods to reclaim storage. This is not preferred, and it would be much better if the kubelet were to retry these calls as they are high consequence. When the calls do not fail, GC runs correctly and no evictions must occur. Example logs:

remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
eviction_manager.go:414] eviction manager: unexpected error when attempting to reduce ephemeral-storage pressure: rpc error: code = DeadlineExceeded desc = context deadline exceeded
eviction_manager.go:340] eviction manager: must evict pod(s) to reclaim ephemeral-storage

image_gc_manager.go:181] [imageGCManager] Failed to monitor images: rpc error: code = DeadlineExceeded desc = context deadline exceeded
remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kubelet.go:1216] Container garbage collection failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
eviction_manager.go:414] eviction manager: unexpected error when attempting to reduce ephemeral-storage pressure: rpc error: code = DeadlineExceeded desc = context deadline exceeded
eviction_manager.go:340] eviction manager: must evict pod(s) to reclaim ephemeral-storage

remote_runtime.go:169] ListPodSandbox with filter nil from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kuberuntime_sandbox.go:198] ListPodSandbox failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
eviction_manager.go:414] eviction manager: unexpected error when attempting to reduce ephemeral-storage pressure: rpc error: code = DeadlineExceeded desc = context deadline exceeded
eviction_manager.go:340] eviction manager: must evict pod(s) to reclaim ephemeral-storage

Comment 3 Seth Jennings 2019-05-02 18:14:23 UTC
There is nothing the kubelet can do here.  It will re-attempt GC later, but retrying individual calls to the CRI is not something that is done, or likely to fix anything.

Sending to Containers.

Comment 4 Steve Kuznetsov 2019-05-03 02:03:14 UTC
Why not? For example interactions with the GitHub API that are both:

1. known to be faulty with some regularity
2. highly important

are re-tried in Prow (think merges). Not doing a re-try (maybe for specific errors?) causes evictions. You can never be certain the runtime underneath is not having a hiccup, but nuking pods on the cluster seems like a high consequence result of a hiccup when all that must happen is a re-try.

Comment 5 Steve Kuznetsov 2019-05-09 17:08:09 UTC
Created attachment 1566281 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 6 Steve Kuznetsov 2019-05-09 17:08:28 UTC
Created attachment 1566282 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 7 Steve Kuznetsov 2019-05-09 17:08:47 UTC
Created attachment 1566283 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 8 Steve Kuznetsov 2019-05-09 17:09:06 UTC
Created attachment 1566284 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 9 Steve Kuznetsov 2019-05-09 17:09:24 UTC
Created attachment 1566285 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 10 Steve Kuznetsov 2019-05-09 17:09:44 UTC
Created attachment 1566286 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 11 Steve Kuznetsov 2019-05-09 17:09:59 UTC
Created attachment 1566287 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 12 Steve Kuznetsov 2019-05-09 17:10:18 UTC
Created attachment 1566288 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 13 Steve Kuznetsov 2019-05-09 17:10:34 UTC
Created attachment 1566289 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 14 Steve Kuznetsov 2019-05-09 17:10:56 UTC
Created attachment 1566290 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 15 Steve Kuznetsov 2019-05-09 17:11:37 UTC
Created attachment 1566291 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 16 Steve Kuznetsov 2019-05-09 17:11:55 UTC
Created attachment 1566292 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 17 Steve Kuznetsov 2019-05-09 17:12:13 UTC
Created attachment 1566294 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 18 Steve Kuznetsov 2019-05-09 17:12:28 UTC
Created attachment 1566295 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 19 Steve Kuznetsov 2019-05-09 17:13:01 UTC
Created attachment 1566296 [details]
logs from kubelet and docker during a GC interval that resulted in eviction

Comment 20 Steve Kuznetsov 2019-05-09 17:13:21 UTC
Created attachment 1566297 [details]
logs from kubelet and docker during a GC interval that resulted in eviction