1759169 – [4.3] oauth-proxy container OOM killed

Bug 1759169 - [4.3] oauth-proxy container OOM killed

Summary: [4.3] oauth-proxy container OOM killed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Standa Laznicka
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1757314 1762748
TreeView+	depends on / blocked

Reported:	2019-10-07 14:20 UTC by Standa Laznicka
Modified:	2020-05-13 21:27 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: oauth-proxy was reloading CA certificates for each request and storing them in memory Consequence: high memory consumption caused the oauth-proxy container to be killed Fix: cache the CA certificates unless they change Result: memory consumption for the oauth-proxy process dropped significantly when multiple requests against it got issued
Clone Of:	1757314
Clones:	1762748 (view as bug list)
Environment:
Last Closed:	2020-05-13 21:27:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift oauth-proxy pull 137	0	'None'	closed	Bug 1759169: Fix high memory usage	2020-11-27 09:30:02 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-05-13 21:27:11 UTC

Description Standa Laznicka 2019-10-07 14:20:22 UTC

This bug was initially created as a copy of Bug #1757314

I am copying this bug because: reasons



Description of problem:
kibana-proxy container from kibana pods were killed for OOM by just a simple HTTP GET to kibana route or to kibana svc:

Sep 30 11:38:17 node-3 kernel: oauth-proxy invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=968
Sep 30 11:38:17 node-3 kernel: oauth-proxy cpuset=docker-a65ce7857b306c3eef1d020828f8705adf9ef84c256d011ad0ea9725c881473c.scope mems_allowed=0
Sep 30 11:38:17 node-3 kernel: CPU: 3 PID: 43442 Comm: oauth-proxy Kdump: loaded Tainted: G        W      ------------ T 3.10.0-957.el7.x86_64 #1
Sep 30 11:38:17 node-3 kernel: Hardware name: Red Hat OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014
Sep 30 11:38:17 node-3 kernel: Call Trace:
Sep 30 11:38:17 node-3 kernel: [<ffffffff99d61dc1>] dump_stack+0x19/0x1b
Sep 30 11:38:17 node-3 kernel: [<ffffffff99d5c7ea>] dump_header+0x90/0x229
Sep 30 11:38:17 node-3 kernel: [<ffffffff997b9dc6>] ? find_lock_task_mm+0x56/0xc0
Sep 30 11:38:17 node-3 kernel: [<ffffffff997ba274>] oom_kill_process+0x254/0x3d0
Sep 30 11:38:17 node-3 kernel: [<ffffffff99900a2c>] ? selinux_capable+0x1c/0x40
Sep 30 11:38:17 node-3 kernel: [<ffffffff99834f16>] mem_cgroup_oom_synchronize+0x546/0x570
Sep 30 11:38:17 node-3 kernel: [<ffffffff99834390>] ? mem_cgroup_charge_common+0xc0/0xc0
Sep 30 11:38:17 node-3 kernel: [<ffffffff997bab04>] pagefault_out_of_memory+0x14/0x90
Sep 30 11:38:17 node-3 kernel: [<ffffffff99d5acf2>] mm_fault_error+0x6a/0x157
Sep 30 11:38:17 node-3 kernel: [<ffffffff99d6f7a8>] __do_page_fault+0x3c8/0x500
Sep 30 11:38:17 node-3 kernel: [<ffffffff99d6f9c6>] trace_do_page_fault+0x56/0x150
Sep 30 11:38:17 node-3 kernel: [<ffffffff99d6ef42>] do_async_page_fault+0x22/0xf0
Sep 30 11:38:17 node-3 kernel: [<ffffffff99d6b788>] async_page_fault+0x28/0x30
Sep 30 11:38:17 node-3 kernel: Task in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poddc5d931a_dfa3_11e9_aa40_fa163ea64b17.slice/
docker-a65ce7857b306c3eef1d020828f8705adf9ef84c256d011ad0ea9725c881473c.scope killed as a result of limit of /kubepods.slice/kubepods-burstable.s
lice/kubepods-burstable-poddc5d931a_dfa3_11e9_aa40_fa163ea64b17.slice/docker-a65ce7857b306c3eef1d020828f8705adf9ef84c256d011ad0ea9725c881473c.sco
pe
Sep 30 11:38:17 node-3 kernel: memory: usage 524288kB, limit 524288kB, failcnt 1834
Sep 30 11:38:17 node-3 kernel: memory+swap: usage 524288kB, limit 524288kB, failcnt 0
Sep 30 11:38:17 node-3 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Sep 30 11:38:17 node-3 kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poddc5d931a_dfa3_11e9_aa40_fa163ea64b17.slice/docker-a65ce7857b306c3eef1d020828f8705adf9ef84c256d011ad0ea9725c881473c.scope: cache:1584KB rss:522704KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:522704KB inactive_file:844KB active_file:740KB unevictable:0KB
Sep 30 11:38:17 node-3 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Sep 30 11:38:17 node-3 kernel: [43367] 1000080000 43367   246541   132083     299        0           968 oauth-proxy
Sep 30 11:38:17 node-3 kernel: Memory cgroup out of memory: Kill process 13784 (oauth-proxy) score 1977 or sacrifice child
Sep 30 11:38:17 node-3 kernel: Killed process 43367 (oauth-proxy) total-vm:986164kB, anon-rss:521632kB, file-rss:6700kB, shmem-rss:0kB
Sep 30 11:38:18 node-3 atomic-openshift-node: I0930 11:38:18.444147     691 kubelet.go:1865] SyncLoop (PLEG): "logging-kibana-6-bx8qz_openshift-logging(dc5d931a-dfa3-11e9-aa40-fa163ea64b17)", event: &pleg.PodLifecycleEvent{ID:"dc5d931a-dfa3-11e9-aa40-fa163ea64b17", Type:"ContainerDied", Data:"a65ce7857b306c3eef1d020828f8705adf9ef84c256d011ad0ea9725c881473c"}
Sep 30 11:38:18 node-3 atomic-openshift-node: I0930 11:38:18.853633     691 kuberuntime_manager.go:513] Container {Name:kibana-proxy Image:registry.redhat.io/openshift3/oauth-proxy:v3.11.117 Command:[] Args:[--upstream-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt --https-address=:3000 -provider=openshift -client-id=kibana-proxy -client-secret-file=/secret/oauth-secret -cookie-secret-file=/secret/session-secret -upstream=http://localhost:5601 -scope=user:info user:check-access user:list-projects --tls-cert=/secret/server-cert --tls-key=/secret/server-key -pass-access-token -skip-provider-button] WorkingDir: Ports:[{Name:oaproxy HostPort:0 ContainerPort:3000 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:OAP_DEBUG Value:true ValueFrom:nil} {Name:OCP_AUTH_PROXY_MEMORY_LIMIT Value: ValueFrom:&EnvVarSource{FieldRef:nil,ResourceFieldRef:&ResourceFieldSelector{ContainerName:kibana-proxy,Resource:limits.memory,Divisor:0,},ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[memory:{i:{value:536870912 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:268435456 scale:0} d:{Dec:<nil>} s: Format:BinarySI}]} VolumeMounts:[{Name:kibana-proxy ReadOnly:true MountPath:/secret SubPath: MountPropagation:<nil>} {Name:aggregated-logging-kibana-token-pc9zf ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[KILL MKNOD SETGID SETUID],},Privileged:nil,SELinuxOptions:nil,RunAsUser:*1000080000,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Sep 30 11:38:18 node-3 atomic-openshift-node: I0930 11:38:18.854051     691 kuberuntime_manager.go:757] checking backoff for container "kibana-proxy" in pod "logging-kibana-6-bx8qz_openshift-logging(dc5d931a-dfa3-11e9-aa40-fa163ea64b17)"




Version-Release number of selected component (if applicable):
OCPv3.11.117 ; kibana oauth proxy image: registry.redhat.io/openshift3/oauth-proxy:v3.11.117


How reproducible:
always: while true; do curl <kibana route or kibana svc>; sleep 1; done


Steps to Reproduce:
1. while true; do curl <kibana route or kibana svc>; sleep 1; done
2.
3.


Actual results:
kibana-proxy container were killed for OOM after few minutes; increasing the requests/limits it just takes more time for the RAM to be exhausted


Expected results:
kibana-proxy not OOM killed and RAM freed

Additional info:

Comment 2 Anping Li 2019-11-08 11:04:19 UTC

The memory increased, could you check it?

Fri Nov  8 18:48:23 CST 2019
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   140m         492Mi           
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
Fri Nov  8 18:48:49 CST 2019
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   140m         540Mi           
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
Fri Nov  8 18:49:15 CST 2019
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   136m         576Mi           
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
Fri Nov  8 18:49:24 CST 2019
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   136m         576Mi           
NAME                      READY   STATUS    RESTARTS   AGE
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
Fri Nov  8 18:49:51 CST 2019
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   137m         628Mi           
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
Fri Nov  8 18:50:17 CST 2019
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   137m         673Mi           
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
Fri Nov  8 18:50:45 CST 2019
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   137m         717Mi           
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
Fri Nov  8 18:50:55 CST 2019
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   137m         717Mi           
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   129m         717Mi           
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
Fri Nov  8 18:51:21 CST 2019
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   141m         148Mi           
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
Fri Nov  8 18:51:30 CST 2019
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   141m         148Mi           
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h
Fri Nov  8 18:51:39 CST 2019
NAME                      CPU(cores)   MEMORY(bytes)   
kibana-598df8b964-gq5lm   131m         148Mi           
NAME                      READY   STATUS    RESTARTS   AGE
kibana-598df8b964-gq5lm   2/2     Running   0          10h

Comment 3 Standa Laznicka 2019-11-08 11:50:52 UTC

Could you please answer these questions so that I can understand what happened:

The memory increased when? When you ran the reproducer?

Is the first report already from when the reproducer ran? Which version of cluster did you use?

The bug was about permanent memory increase with each request but there is an obvious memory usage drop in reports from "Fri Nov  8 18:50:55 CST 2019" and "Fri Nov  8 18:51:21 CST 2019", which I assume is when you stopped the reproducer, which would therefore mean that the bug was fixed?

Comment 4 Anping Li 2019-11-12 06:54:18 UTC

Yes, when I run the reproducer. the memory increased.
(In reply to Standa Laznicka from comment #3)
> Could you please answer these questions so that I can understand what
> happened:
> 
> The memory increased when? When you ran the reproducer?

Yes, when I run the reproducer.
> 
> Is the first report already from when the reproducer ran? Which version of
> cluster did you use?
It is v4.3 
> 
> The bug was about permanent memory increase with each request but there is
> an obvious memory usage drop in reports from "Fri Nov  8 18:50:55 CST 2019"
> and "Fri Nov  8 18:51:21 CST 2019", which I assume is when you stopped the
> reproducer, which would therefore mean that the bug was fixed?

When the memory usage dropped, the reproducer is still running.  there may be some improvement. But it seems that is not enought.   as the size 717Mi is near by the memory limit 760Mi. I had observed the pod been restarted

Comment 5 Standa Laznicka 2019-11-12 13:15:58 UTC

Anping, can you also specify the version of oauth-proxy image you tested against?

I tested the oauth-proxy which gets distributed with the 4.3 cluster and has the fix. I bombed the container with some 2k request in 10 seconds and I did not see any unreasonable memory consumption increase, if any at all.

If you're indeed testing the fixed oauth-proxy version, can you add the --containers flag to `oc adm top` to see that it is actually the proxy which consumes the resources?

Comment 6 Anping Li 2019-11-13 07:09:23 UTC

Verified in v4.3.0-201911080552.  The kibana pods wasn't restarted in 4 hours. 

[43]$ head check.logs
Already on project "openshift-logging" on server "https://api.qe-anlia3.qe.gcp.devcluster.openshift.com:6443".
Tue Nov 12 10:01:46 EST 2019
NAME                      READY   STATUS    RESTARTS   AGE
kibana-766bf67db5-rcqj6   2/2     Running   0          8m8s
POD                       NAME           CPU(cores)   MEMORY(bytes)   
kibana-766bf67db5-rcqj6   kibana         8m           101Mi           
kibana-766bf67db5-rcqj6   kibana-proxy   0m           11Mi            
Tue Nov 12 10:01:52 EST 2019
NAME                      READY   STATUS    RESTARTS   AGE
kibana-766bf67db5-rcqj6   2/2     Running   0          8m14s

[anli@preserve-docker-slave 43]$ tail check.logs
kibana-766bf67db5-rcqj6   2/2     Running   0          16h
POD                       NAME           CPU(cores)   MEMORY(bytes)   
kibana-766bf67db5-rcqj6   kibana         10m          95Mi            
kibana-766bf67db5-rcqj6   kibana-proxy   24m          12Mi            
Wed Nov 13 02:07:52 EST 2019
NAME                      READY   STATUS    RESTARTS   AGE
kibana-766bf67db5-rcqj6   2/2     Running   0          16h
POD                       NAME           CPU(cores)   MEMORY(bytes)   
kibana-766bf67db5-rcqj6   kibana-proxy   24m          12Mi            
kibana-766bf67db5-rcqj6   kibana         10m          95Mi

Comment 8 errata-xmlrpc 2020-05-13 21:27:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.